Learning Articles - Data Science

Lacrimae rerum. Memento mori. Memento vivere.

NumPy Multi-Dimensional Computations

NumPy offers the capabilities to define multi-dimensional arrays, along with a large collection of mathematical operations and algorithms to work with these arrays. The core functionality is its data structure and methods for manipulation, which act as an efficient and flexible container to be utilized when analysing data sets. Various packages build on the functionality of NumPy for applications in specialized domains, such as Pandas and Statsmodels for data analysis, TensorFlow and PyTorch for deep learning, and OpenCV for image processing. These notes rely on the ideas and learnings from the respective package documentations, "Python For Data Analysis: Data Wrangling With Pandas, NumPy, And Jupyter", 3rd Edition, by Wes McKinney (creator and developer of Pandas) in 2022, and "Python Data Science Handbook: Essential Tools For Working With Data", 2nd Edition, by Jake VanderPlas in 2022.

https://numpy.org/doc/stable/ https://wesmckinney.com/book/ https://jakevdp.github.io/PythonDataScienceHandbook/

When using Python for data analysis and data science, the most common packages and libraries include NumPy as the numeric library founding all of the calculations; Pandas as the cornerstone of data manipulation; Matplotlib, Seaborn, and Plotly for intricate visualizations; Statsmodels for advanced statistical functions; SciPy for advanced scientific computing; and Scikit-Learn for as a toolkit for machine learning; and TensorFlow and PyTorch for artificial intelligence applications. For convenience, Anaconda can be used for a distribution of Python with pre-installed packages which focus on data analysis and data science. In many cases, the vast and open network of packages and libraries available for Python can be leveraged depending on the requirements of projects. It should be kept in mind that the use of Python diverges from traditional tools used for data analysis which are primarily visual through point-and-click interfaces, such as Microsoft Excel and Tableau, and become difficult to use when processing very large sets of data.

https://www.python.org/ https://numpy.org/ https://github.com/numpy/numpy

Installation And Setup

NumPy (short for "Numerical Python") has been the fundamental package for scientific computing with Python. This is achieved through efficient multi-dimensional arrays, numerical computing tools and arithmetic with element-wise considerations, advanced mathematical operations and algorithms, flexibility, ease of use with high-level and low-level syntax, interoperable support for a wide range of hardware and computing platforms, performant optimization with the core code written in C and Fortran, and leverage of linear algebra packages (BLAS and LAPACK). NumPy also features an API allowing for other packages (and even native C or C++) to use it as a container and access its data structures and computational facilities (useful for wrapping legacy codebases). Through growth and open accessibility, NumPy forms an integral tool for quantum computing, statistical computing, signal processing, image processing, graphs and networks, astronomy processes, cognitive psychology, bioinformatics, mathematics, chemistry, geoscience, geographic processing, architecture, engineering, and machine learning.

The only prerequisite to install NumPy is Python. NumPy can usually be installed through a package manager, as conventionally performed using Pip, or, alternatively, through the native package manager of a Linux distribution (although this version may be outdated or may not be officially maintained). For advanced developers, NumPy can be built and installed from its source code with control over the options for compiling. Once installed, NumPy can be imported into a project.

Install NumPy using Pip from Python Packaging Index (PyPI) or Conda from Anaconda:

pip install numpy

pip install --upgrade numpy

conda install numpy

conda update numpy

Import NumPy in a script to be used for a project (with conventional shorthand):

import numpy

import numpy as np

Modify the configuration of Numpy with commonly used ...settings... :

...

numpy.set_printoptions (precision = 4, suppress = True)

...

...

Array ...And Scalar... Creation

An array can be created in multiple dimensions and, conventionally, forms a scalar of 0 dimensions, vector with 1 dimension (as either a row or column (although technically there is no distinction)), matrix with 2 dimensions (as a collection of rows and columns), and tensor with 3 (or more) dimensions (as a collection of layers or pages, rows, and columns). Technically and semantically, dimensions are usually referred to as axes. The array is fundamentally a data structure of the library, but it should only contain data types which are homogeneous (such that every item takes up the same size block of memory and all of the blocks are interpreted in the same way based on thed data type), otherwise the mathematical operations performed may be extremely inefficient. In order to create an array, it is necessary to pass an object to the array, where this object is usually a number for a scalar, list for a vector, list of lists for a matrix, or list of lists of lists for a tensor with the dimensions respectively inferred as layers, rows, and columns. It is also possible to create an array with the automatic population of 0, 1, range of numbers, or uninitialized or arbitrary numbers. Considering the difference between views (same data buffer in memory) and copies (duplicated data buffer in memory), different arrays can also share the same data, so that changes to an array would be directly linked to the other array.

Scalar.

https://numpy.org/doc/stable/reference/arrays.ndarray.html https://numpy.org/doc/stable/reference/arrays.scalars.html

Conceptual diagram showing the relationship between an array, data type, and array-scalar objects:

Hierarchy of the type objects which represent the data type of arrays:

https://numpy.org/doc/stable/_images/dtype-hierarchy.png

These arrays are able to be efficient and generic containers, as NumPy internally stores data in a contiguous block of memory which is independent of other built-in objects. The library of operations and algorithms can then work with this memory without any type checking or other overheads. This mechanism also allows for complex computations to be performed on the entire array without the need for loops. As mentioned, these operations and algorithms are written in C and Fortran which allows them to execute quickly and robustly (generally 10 to 100 or more times faster relative to regular Python and use significantly less memory). The relevant data types for arrays include int8, uint8, int16, uint16, int32, uint32, float16 (half precision), float32 (single precision), float64 (double precision), float128 (extended precision), complex64, complex128, complex256, bool, and object.

https://numpy.org/doc/stable/reference/routines.array-creation.html

Create an array with the given values for the layers, rows, and columns:

numpy.array (object, dtype = None, *, copy = True, order = "K", subok = False, ndmin = 0, like = None)
			object = [D0, D1, D2, D3, D4, D5, D6, D7, D8, D9]
			object = [[R0C0, R0C1, R0C2, R0C3, R0C4, R0C5, R0C6, R0C7]]
			object = [[R0C0], [R1C0], [R2C0], [R3C0], [R4C0], [R5C0], [R6C0], [R7C0]]
			object = [[R0C0, R0C1, R0C2], [R1C0, R1C1, R1C2], [R2C0, R2C1, R2C2]]
			object = [[[R0C0L0, R0C1L0], [R1C0L0, R1C1L0]], [[R0C0L1, R0C1L1], [R1C0L1, R1C1L1]]]

Create an array of a given shape and type filled with a specific number for each element:

numpy.zeros (shape, dtype = float, order = "C", *, like = None)

numpy.ones (shape, dtype = None, order = "C", *, like = None)

numpy.identity (count_rows_columns, dtype = None, *, like = None)

numpy.full (shape, fill_value, dtype = None, order = "C", *, like = None)

numpy.empty (shape, dtype = float, order = "C", *, like = None)

Create an array of a given shape and type filled with a range or grid of numbers:

numpy.arange ([start, ] stop, [step, ] dtype = None, *, like = None)

numpy.linspace (start, stop, num = 50, endpoint = True, retstep = False, dtype = None, axis = 0)

numpy.meshgrid (x, y, z, copy = True, sparse = False, indexing = "xy")

The information and properties intrinsic to an array are reflected by the attributes of the array. Some common attributes include the rank as the number of dimensions, shape as the size along each dimension (layers, rows, and columns), and size as the total number of elements. ...

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

An array can be indexed to create a slice by a list or tuple of integers (positional indexing), booleans (logical indexing), or another array for advanced indexing. Instead of applying indices recursively (indexing into layers, then indexing into rows, and then indexing into columns), it is possible to directly specify the indices in groups for layers, rows, and columns (although it can be helpful to think directly in terms of dimensions rather than layers, rows, and columns). This can also be performed with booleans for the indexes, which can be useful when combined with logic functions to filter for specific elements. A useful approach is to use the newaxis object (or None) with indexing to expand the dimensions of the resulting selection by a unit-length dimension. It should be noted that and and or do not work with boolean arrays.

A distinction needs to be made between basic indexing and advanced indexing. The primary difference between basic indexing and advanced indexing is that basic indexing will only select a slice from an array, while advanced indexing will select an arbitrary group from an array (allows for repetition of indices). Under basic indexing, a slice of the original array is referenced, where this slice is a view (use the same values in memory) and any modification to the view will be reflected in the original array (need to explicitly specify a copy to create a new object). Under advanced indexing, a group from the original array is created, where this group is a copy and ...acts as... a new object. It should be noted that selecting data by boolean indexing and assigning the result will always create a copy of the data. In addition, the search order for indexing is row-major (fill the consecutive elements of a row before moving to subsequent rows).

https://numpy.org/doc/stable/user/basics.indexing.html https://numpy.org/doc/stable/user/basics.copies.html

Could make image comparing orders: https://en.wikipedia.org/wiki/Row-_and_column-major_order and https://wesmckinney.com/book/numpy-basics.html#figure_ndarray_indexing.

Basic Array Manipulation

In most cases, there are equivalent general functions and methods which can be used for the manipulation of an array. The shape of an array can be change, as long as the new shape is compatible with the original shape (consistent number of elements). Following on, transposing is a special form of reshaping, where the axes are flipped or swapped based on the permuted order. Additional values can be inserted into or appended onto an array, while other values can be deleted using the appropriate indices. Multiple arrays can also be concatenated or joined along an existing axis, where the arrays must have the same shape except in the dimension corresponding to the axis. Likewise, it is possible to stack arrays along a new axis, where the arrays must have the same shape (rebuilds arrays which have been split). Conversely, an array can be split or divided into a list of multiple sub-arrays in sections of equal size or at specific indices, horizontal column-wise sub-arrays, vertical row-wise sub-arrays, or depth layer-wise sub-arrays.

https://numpy.org/doc/stable/reference/routines.array-manipulation.html

Return a view with a modified shape of the axes of the input array:

numpy.reshape (array, shape, order = "C")

numpy.transpose (array, axes = None)

numpy.swapaxes (array, axis_0, axis_1)

numpy.flip (array, axis = None)

numpy.fliplr (array)

numpy.flipud (array)

Return a copy with values added to or removed from the input arrays:

numpy.insert (array, indices, values, axis = None)

numpy.append (array, values, axis = None)

numpy.delete (array, indices, axis = None)

Return a ...view...copy... with the input arrays joined along a specific existing axis:

numpy.concatenate ((array_1, array_2, array_3), axis = 0, out = None, dtype = None, casting = "same_kind")

Return a ...view...copy... with the input arrays joined along a specific new axis:

numpy.stack ((array_1, array_2, array_3), axis = 0, out = None, *, dtype = None, casting = "same_kind")

numpy.hstack ((array_1, array_2, array_3), *, dtype = None, casting = "same_kind")

numpy.vstack ((array_1, array_2, array_3), *, dtype = None, casting = "same_kind")

numpy.dstack ((array_1, array_2, array_3))

Return a ...view...copy... with the input array split into equal sizes along a specific axis:

numpy.split (array, sections_indices, axis = 0)

numpy.hsplit (array, sections_indices)

numpy.vsplit (array, sections_indices)

numpy.dsplit (array, sections_indices)

Calculations And Operations

An arithmetic operation can be performed in an element-wise or matrix-wise manner. For element-wise operations, these can be thought of as batch operations, where the operation is applied to each element of the array (often referred to as vectorization). For matrix-wise operation, these usually involve linear algebra, such as matrix multiplication, conjugation, inversion, decompositions, determinants, or eigenvalues. Similarly, logic functions can be used to evaluate an array with the results given as a boolean in an element-wise or matrix-wise manner - common logic functions include evaluation whether arrays are greater than, less than, or equal to variants. If the arrays are not the same shape, they must be broadcastable to a common shape along the dimensions. It should be noted that, due to this broadcasting and whenever an operation involves an array with a scalar, an element-wise operation will be performed, where the scalar is applied to each element of the array based on the operation.

https://numpy.org/doc/stable/reference/routines.math.html https://numpy.org/doc/stable/reference/routines.linalg.html https://numpy.org/doc/stable/reference/routines.logic.html

Return an array of the element-wise sum, difference, product, or quotient between input arrays:

numpy.add (array_augend, array_addend, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.subtract (array_minuend, array_subtrahend, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.multiply (array_multiplicand, array_multiplier, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.divide (array_dividend, array_divisor, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.mod (array_dividend, array_divisor, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.power (array_base, array_exponent, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

Return an array of the matrix-wise product and other operations between input arrays:

numpy.matmul (array_0, array_1, /, out = None, *, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj, axes, axis])

numpy.dot (array_0, array_1, out = None)

numpy.linalg.inv (array)

numpy.linalg.det (array)

numpy.linalg.eigvals (array)

numpy.linalg.solve (array_a, array_b)

Return an array of the element-wise boolean result from a logical comparison (invert with ~):

numpy.greater (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.greater_equal (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.less (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.less_equal (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.equal (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.not_equal (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

Return an array of the matrix-wise boolean result from a logical comparison (invert with ~):

numpy.all (array, axis = None, out = None, keepdims = <no value>, *, where = <no value>)

numpy.any (array, axis = None, out = None, keepdims = <no value>, *, where = <no value>)

numpy.array_equal (array_0, array_1, equal_nan = False)

Return an array of the boolean result from a logical combination (invert with ~):

numpy.logical_not (array, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.logical_and (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.logical_or (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

numpy.logical_xor (array_0, array_1, /, out = None, *, where = True, casting = "same_kind", order = "K", dtype = None, subok = True [, signature, extobj])

It should be noted that these functions which perform element-wise operations on data in an array are known as universal functions. These can be thought of as vectorized wrappers for simple functions working in an element-by-element fashion, supporting broadcasting, and supporting type casting (unary universal functions act with 1 array, while binary universal functions act with 2 arrays). The direct advantage of universal functions is the ability to replace explicit loops with simple array expressions which are faster and more efficient (vectorization). In combination with ...regular... functions, many calculations can be then performed for statistics (mean, median, standard deviation, etc), sorting (numerical, alphabetical, ascending, descending, etc), and sets (unique, ..., etc).

https://numpy.org/doc/stable/reference/ufuncs.html https://numpy.org/doc/stable/reference/routines.statistics.html https://numpy.org/doc/stable/reference/routines.sort.html https://numpy.org/doc/stable/reference/routines.set.html

Examples of common universal and ...regular... functions used for statistics:

...

...

Examples of common universal and ...regular... functions used for sorting:

...

...

Examples of common universal and ...regular... functions used for sets:

...

...

...Constants...

There are several constants which represent ... . These include infinity as inf (with aliases of Inf, Infinity, PINF, and infty), positive infinity as PINF, negative infinity as NINF, not a number as nan (with aliases of NaN and NAN), positive zero as PZERO, negative zero as NZERO, Euler's number as e, Euler's constant as euler_gamma, and pi as numpy.pi. ...

https://numpy.org/doc/stable/reference/constants.html

Structured Array

A structured array is a data type which contains ..., where each sub-type is a field which has a name, type, and optional title.

https://numpy.org/doc/stable/user/basics.rec.html

Pseudo-Random Number Generation

In supplement to the built-in module, the pseudo-random number generator allows for the creation of random samples of values from different probability distributions. Some common examples of these distributions include uniform, normal, beta, chi-square, and gamma. For the mechanisms, Generator includes algorithmic improvements and serves as a replacement for RandomState (legacy without further development). There is also functionality for performing permutations or sampling from an array. It should be noted that the results are deterministically reproducible based on the seed used for the initial state (although there is no version compatibility guarantee).

https://numpy.org/doc/stable/reference/random/index.html https://numpy.org/doc/stable/reference/random/generator.html https://numpy.org/doc/stable/reference/random/legacy.html

Return a pseudo-random number generator (instance of Generator) with a modified seed and configuration:

numpy.random.default_rng (seed = 12345)

Return a random sample from the continuous uniform distribution between 0 and 1:

numpy.random.Generator.random (size = None, dtype = numpy.float64, out = None)

Return a random sample of values drawn from different probability distributions:

numpy.random.Generator.uniform (low = 0.0, high = 1.0, size = None)

numpy.random.Generator.standard_normal (size = None, dtype = numpy.float64, out = None)

numpy.random.Generator.normal (mean = 0.0, standard_deviation = 1.0, size = None)

numpy.random.Generator.integers (low, high = None, size = None, dtype = numpy.int64, endpoint = False)

numpy.random.Generator.binomial (parameter_n, parameter_p, size = None)

numpy.random.Generator.beta (alpha, beta, size = None)

numpy.random.Generator.chisquare (degrees_freedom, size = None)

numpy.random.Generator.gamma (shape, scale = 1.0, size = None)

Shuffle the order in-place or return a permuted array from the input array:

numpy.random.Generator.shuffle (array, axis = 0)

numpy.random.Generator.permuted (array, axis = None, out = None)

numpy.random.Generator.permutation (array, axis = 0)

Return a random sample from an array (setting probabilities uses a general sampler instead of the default):

numpy.random.Generator.choice (array, size = None, replace = True, probabilities = None, axis = 0, shuffle = True)

Saving And Loading

It is possible to save and load data in binary or text formats. The default format is an uncompressed raw binary file for a single array with npy as the extension. It is also possible to save the data as an uncompressed or compressed zipped archive of multiple arrays with npz as the extension. Alternatively, the data can be saved in a common text format, where it is necessary to specify the delimiter, new line character, header, footer, comments, and encoding.

https://numpy.org/doc/stable/reference/routines.io.html https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html

Save arrays as an uncompressed raw binary file, uncompressed archive, or compressed archive:

numpy.save (file_name, array, allow_pickle = True, fix_imports = True)

numpy.savez (file_name, arr_0 = array_0, arr_1 = array_1, *args, **kwds)

numpy.savez_compressed (file_name, arr_0 = array_0, arr_1 = array_1, *args, **kwds)

Load arrays from an uncompressed raw binary file, uncompressed archive, or compressed archive:

numpy.load (file_name, mmap_mode = None, allow_pickle = False, fix_imports = True, encoding = "ASCII", *, max_header_size = 10000)

Save an array to a text file, such as txt, csv, tsv, or other delimited files:

numpy.savetxt (file_name, X, fmt = "%.18e", delimiter = " ", newline = "\n", header = "", footer = "", comments = "# "", encoding = None)

Load an array from a text file, such as txt, csv, tsv, or other delimited files:

numpy.loadtxt (file_name, dtype = <class "float">, comments = "#"", delimiter = None, converters = None, skiprows = 0, usecols = None, unpack = False, ndmin = 0, encoding = "bytes", max_rows = None, *, quotechar = None, like = None)